1 Summary

The goal of this analysis was to investigate differential expression in mice long- and short-term hematopoietic stem cells (HSCs) with and without knockout of BCAP. Raw RNA-seq data was previously processsed and aligned to the GRCm37/mm9 reference genome with bowtie and TopHat. Read counts were generated using htseq-count.

When controlling for HSC population (hscPop: long- vs. short-term) and RNA concentration (ng_per_ul) variables, 274 genes were found to be significantly differentially expressed between knockout groups (koStatus: WT vs. BCAP KO); 57 of these genes showed > 2 absolute fold-change difference between groups.


2 R environment set-up

2.1 Loading packages

I first loaded required packages for data munging, visualization, and analysis (these are largely Hadley Wickham libraries, plus some Bioconductor tools).

rm(list = ls())
cleanup <- gc(verbose = FALSE)

# Load libraries I'll need here
library(edgeR)
library(limma)
library(biomaRt)
library(ggbiplot)
library(cowplot)
library(sva)

# Load my go-to libraries
library(dplyr)
library(ggplot2)
library(ggthemes)
library(stringr)
library(readr)
library(readxl)
library(reshape2)

# Packages for R markdown stuff
library(knitr)
library(shiny)

 

2.2 Defining functions

Functions for plotting metrics are contained in metric_qc_functions.R.

source("R/metric_qc_functions.R")

This is a function written by Elizabeth Whalen (shared by Michael Mason) that might come in handy with some steps of the analysis. I modified the function slightly, such that library sizes are updated and normalization factors are calculated after filtering genes. I also added the option to input gene annotation information.

# Function to build DGEList object, filter genes by keeping only those having % 
# samples with at least N counts, and computes normalization from library sizes
setUpDGEList <- function(countData, geneData = NULL, 
                         filterCount = 1, filterPercentage = 0.1)
{
    d <- DGEList(counts = countData, genes = geneData)
    # d <- calcNormFactors(d) # moved further down
    
    # Filter all genes that do not have at least 'filterCount' counts per 
    # million in at least 'filterPercentage' percent of libraries
    keepRows <- rowSums(round(cpm(d$counts)) >= filterCount) >= 
        filterPercentage*ncol(countData)
    print(table(keepRows))

    curDGE <- d[keepRows, ]
    
    # James: I've added this change so that library sizes and normalization
    # factors will always be updated/calculated after filtering genes
    
    # reset library sizes
    curDGE$samples$lib.size <- colSums(curDGE$counts)
    
    # calculate normalization factors (effective library size = 
    # lib.size * norm.factor)
    curDGE <- calcNormFactors(curDGE)
    return(curDGE)
}

 

2.3 Loading data

Next, I read counts and metrics data for the project into R, along with sample annotation for project libraries.

# Read CSV file with read counts
countFile <- "data/HMMF2ADXX_combined_counts.csv"
countDat <- read_csv(countFile) # 37991 obs. of  18 variables
# str(countDat)

# Read CSV file with RNAseq/alignment metrics
metricFile <- "data/HMMF2ADXX_combined_metrics.csv"
metricDat <- read_csv(metricFile) # 16 obs. of  71 variables
# str(metricDat)

# Read XLSX file with sample annotation
designFile <- "data/JMD119 Sample Information .xlsx"
designDat <- read_excel(designFile, skip = 1) # 36 obs. of 18 variables
# str(designDat)

 

2.4 Cleaning data

I needed to do a bit of cleaning/formatting with variable names (column headers) to make life easier and avoid breaking downstream functions.

# Separate gene counts and gene symbols into separate objects, reformat
# variable names in countDat to only include libID
geneDat <- data_frame(ensemblID = countDat$geneName)
countDat <- countDat %>% 
    select(-geneName)
names(countDat) <- names(countDat) %>% 
    str_extract("lib[0-9]+")

# Reformat variable names in metrics data frame
names(metricDat) <- names(metricDat) %>% 
    str_to_lower() %>%  # change variable names to lower case
    make.unique(sep = "_") # de-dup variable names
names(metricDat)[1] <- "lib_id" # reformat libID variable name

# Reformat row names in metrics dataframe
metricDat <- metricDat %>% 
    mutate(lib_id = str_extract(lib_id, "lib[0-9]+"))

# Reformat variable names in design data frame
names(designDat) <- names(designDat) %>% 
    str_replace_all(" +", "_") %>% # replace spaces with underscores
    str_replace_all("#", "num") %>%  # replace # with 'num'
    str_replace_all("/", "_per_") %>% 
    str_replace_all("(\\(|\\))", "") %>% # remove parentheses
    str_to_lower() %>% # change to lower case
    str_replace("(?<=(lib))[a-z]+", "") %>% # replace 'library' with 'lib'
    make.unique(sep = "_") # de-dup variable names

# Remove empty rows from design data frame
designDat <- designDat %>% 
    filter(!is.na(lib_id))

 

I created a new object to store the salient information about groups in the study I want to compare.

groupDat <- designDat %>% 
    # extract knockout status (WT or BCAP) and HSC population (long or short'
    # term); combine into a single group vector
    mutate(koStatus = as.factor(tolower(str_extract(sample_name, "WT|BCAP"))),
           hscPop = as.factor(tolower(str_extract(hsc_population, 
                                                  "Long|Short"))),
           group = as.factor(str_c(koStatus, hscPop, sep = "_"))) %>% 
    select(libID = lib_id,
           koStatus, hscPop, group)

For reference, here are the relevant groups in the data (stored in groupDat):

libID koStatus hscPop group
lib7418 wt long wt_long
lib7419 wt long wt_long
lib7420 wt long wt_long
lib7421 wt long wt_long
lib7422 bcap long bcap_long
lib7423 bcap long bcap_long
lib7424 bcap long bcap_long
lib7425 bcap long bcap_long
lib7426 wt short wt_short
lib7427 wt short wt_short
lib7428 wt short wt_short
lib7429 wt short wt_short
lib7430 bcap short bcap_short
lib7431 bcap short bcap_short
lib7432 bcap short bcap_short
lib7433 bcap short bcap_short

3 Inspecting data

3.1 Metric plotting & QC

Next, I looked at a few standard metrics to see whether any libraries should be excluded due to quality reasons.

# Pull out and format the subset of metrics to plot
metricSummary <- metricDat %>% 
    mutate(percentDuplication = unpaired_read_duplicates / 
               unpaired_reads_examined) %>% 
    select(libID = lib_id, 
           medianCVcoverage = median_cv_coverage, 
           fastqTotalReads = fastq_total_reads, 
           percentAligned = mapped_reads_w_dups,
           percentDuplication)

Cutoffs are set by default to standard values used in the Bioinformatics Core for three metrics; libraries are considered to have ‘failed’ QC for the following conditions:

  • < 1 million total FASTQ reads
  • < 80% aligned reads
  • > 1.0 median CV coverage

I can also adjust slider bars to look at different QC cutoffs (red lines) for the x- and y-axis; dashed lines indicate outlier limits (1.5*IQR).

 

3.1.1 Percent aligned

Percent aligned is simply the number of FASTQ reads for which there is a corresponding alignment in the TopHat BAM file. In other words, percentAligned = # of aligned reads (+ all their duplicate reads, which were removed from the final BAM) / fastqTotalReads.

Percent aligned:
Percentage of aligned reads is well above the 80% cutoff for all libraries, with rates in the mid-90s across the board. lib7422 is outside the nominal outlier range, but still has 91.41% alignment.

Total reads:
While all libraries had well over 10 million reads in the input FASTQ file (after adapter trimming), lib7418 appears to be quite a bit smaller than average the average of 24.27 million reads, with only 13.04 million reads.

 

3.1.2 Median CV coverage

Median CV coverage is calculated by Picard by

  1. calculating the coeficient of variation (CV) in read coverage along the length of each of the 1000 most highly expressed transcripts;
  2. calculating the median CV across these 1000 transcripts.

A high CV of read coverage suggests possible 5’ or 3’ bias, or potentially non-uniform (“bumpy” or “spikey”) coverage of a transcript. If medianCVcoverage is high (> 1), this could indicate a more systemic problem with coverage in the dataset.

Median CV coverage:
All libraries look good (with medianCVcoverage generally close to 0.5) in terms of gene coverage among the top 1000 transcripts.

 

3.2 Examining read count data

I plotted the (smoothed) frequency of log-normalized counts in each library, just to get a sense of the distribution.

Without any filtering, there’s a large spike at -1, representing genes with a count of 0, along with a smaller bump between 0 and 1, representing genes with very small counts. Notably, the distribution of counts for lib7418 is shifted dramatically to the left (which makes sense, given the much smaller number of pre-aligned reads).

 

3.2.1 Gene filtering

I used the function defined above to build the DGEList object for the data, which is the input for downstream functions.

# Filter genes with (cpm > 10) in < 20% of samples
dge = setUpDGEList(countData = countDat, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26239 11752

Keeping only those genes with \(\geq\) 10 counts per million in at least 20% ( 3.2 samples), we’re left with 11752 genes.

While there are still a small fraction of genes with zero or very few counts, these no longer account for such a large percentage of genes across all libraries.

Correcting for library size, the distribution of counts per million (CPM) is more similar across all libraries, including lib7418.

 

3.2.2 Checking code edits

To verify the effect of the change I made to Elizabeth’s code above, I plotted norm.factors and effective library size (lib.size.eff) under two scenarios:

  • preFilter: norm.factors are calculated before genes are filtered
  • postFilter: norm.factors are calculated after genes are filtered and library sizes are updated

For the not-too-stringent threshold used to filter genes (CPM \(\geq\) 10 in 20% of samples), the order of operations for calculating norm.factors appears to have minimal impact on effective library sizes.

 

3.2.3 Gene annotation

I used the biomaRt package to add gene symbols (from MGI) corresponding to gene IDs from Ensembl.

# Get MGI gene symbols corresponding to Ensembl Gene IDs
dgeGeneDat <- dge$genes
mart <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
ens2Gene <- getBM(attributes = c("ensembl_gene_id", "mgi_symbol"), 
                  filters = "ensembl_gene_id", 
                  values = dgeGeneDat$ensemblID, mart = mart)
ens2Gene = ens2Gene[match(dgeGeneDat$ensemblID, ens2Gene$ensembl_gene_id), ]

# Insert MGI gene symbols for genes in DGE object gene info
dgeGeneDat <- dgeGeneDat %>% 
    mutate(mgiSymbol = ens2Gene$mgi_symbol,
           mgiSymbol = ifelse(is.na(mgiSymbol), "NA", mgiSymbol))

dge$genes <- dgeGeneDat

 

3.2.4 Principal component analysis (PCA)

I performed PCA (on log-normalized CPM values) with the prcomp function and plotted the first two principal components using the ggbiplot function from the ggbiplot package.

The second principal component (PC2) appears to explain the difference between long- and short-term HSC populations—this is clearly a strong singal in the complete dataset. PC1, on the other hand, is not obviously correlated with any experimental group or any of the metrics I examined above.

Among the first two principal components, there is no clear clustering by KO status.

Looking at some of the other top principal components, PC4 appears to have the strongest (though not perfect) relationship with KO status. When PC4 is plotted against PC2 (related to HSC population), the samples separate reasonably well into the expected groups. PC3 seems to have almost an inverse relationship with PC2.

 

3.2.5 Possible sources of confounding

To try to get a better sense of what PC1 might represent in the data, I plotted several experimental, sequencing, and alignment variables against PC1 values.

While PC1 does appear to show some relationship with KO status, the two extreme libraries on the PC1 axis (lib7418 and lib7427) both also have very low RNA concentrations (ng_per_ul). None of the other variables seem to be particularly associated with PC1.

 

I also plotted the same set of variables against koStatus.

None of the variables jump out to me as being obviously correlated with koStatus, with the exception of mice age_in_weeks. Because KO mice are all at least 1/2 week older than WT mice, I don’t believe there’s any way to effectively rule out this variable as being a source of expression changes. However, I believe the age difference is small enough that the effects are relatively small.

Despite the slope of some of the fitted lines, none of the RNA-seq quality metrics appear to be particularly skewed by koStatus.


4 Differential expression analysis

4.1 Building models

Design 1: expression ~ koStatus

The first, possibly most naive, model I trained was predicting expression as a function of BCAP knockout status (i.e., koStatus).

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesign <- model.matrix(~ koStatus, data = groupDat)
koVoom <- voomWithQualityWeights(dge, design = koDesign,
                                    plot = TRUE)

Using voomWithQualityWeights from the limma package, several libraries are downweighted in the normalized data object. The two libraries with the lowest sample weights are once again those with low ng_per_ul values (and also with extreme PC1 values).

# Fit model for the group design
koFit <- lmFit(koVoom, koDesign)
koFit <- eBayes(koFit)
koResults <- topTable(koFit, number = nrow(dge))
This model produces 10 significant genes:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.33 5.5 8.1 0 0.003 6.4
ENSMUSG00000092250 Gm20467 -0.78 5.8 -6.4 0 0.021 4.1
ENSMUSG00000040105 Ppapdc2 -1.69 5.1 -6.8 0 0.016 3.9
ENSMUSG00000026104 Stat1 -0.45 7.8 -5.7 0 0.043 2.7
ENSMUSG00000025793 Hgs 1.07 6.6 5.6 0 0.043 2.7
ENSMUSG00000038418 Egr1 0.67 7.9 5.7 0 0.043 2.7
ENSMUSG00000022412 Mief1 0.71 6.7 5.5 0 0.046 2.5
ENSMUSG00000003545 Fosb 1.43 5.4 5.6 0 0.043 2.4
ENSMUSG00000034189 Hsdl1 0.62 7.5 5.4 0 0.050 2.2
ENSMUSG00000047181 Samd14 1.18 4.8 5.6 0 0.043 2.1

logFC in this table represents the log fold-change in expression per unit change in koStatus (in other words, the log FC in wild-type relative to KO).

 

4.1.1 A note about BCAP

Interestingly, the gene for BCAP (i.e., the knock-out target), Pik3ap1, is not significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 1.1 5.9 2.6 0.019 0.34 -3.3

Based on previous inspection of the aligned data in IGV, the knockout of BCAP does not manifest entirely cleanly on the RNA level. As seen in the plots below, BCAP has at least low counts in long-term HSCs and moderate counts in short-term HSCs, even in libraries from KO mice.

 

Design 2: expression ~ koStatus + hscPop

Because HSC population (i.e., hscPop: long- vs. short-term) accounts for such a large fraction of variance in the data, it makes sense to control for this variable by including it as a parameter in the model. We lose a degree of freedom in the process, but retain more DoF than if we were to subset the data and focus only on long- or short-term mice.

# Define the design matrix, including terms corresponding to KO status and HSC
# population; use voom to calculate transformed expression values
koHscDesign <- model.matrix(~ koStatus + hscPop, data = groupDat)
koHscVoom <- voomWithQualityWeights(dge, design = koHscDesign,
                                    plot = TRUE)

# Fit model for the group design
koHscFit <- lmFit(koHscVoom, koHscDesign)
koHscFit <- eBayes(koHscFit)
koHscResults <- topTable(koHscFit, coef = 2, number = nrow(dge))

While the model includes parameters for both koStatus and hscPop, we’re most interested in the effect of koStatus on expression. By examining the results for the 2nd model coefficient (1st coefficient is the intercept), we can interpret logFC in the table below as the log FC in expression in WT relative to KO mice, when all other parameters are held constant.

This model produces 14 significant genes for koStatus:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.32 5.5 8.4 0 0.002 7.2
ENSMUSG00000064215 Ifi27 -0.64 7.3 -7.3 0 0.007 5.5
ENSMUSG00000040105 Ppapdc2 -1.72 5.1 -6.7 0 0.015 4.3
ENSMUSG00000092250 Gm20467 -0.77 5.8 -6.5 0 0.016 4.2
ENSMUSG00000026104 Stat1 -0.45 7.8 -6.4 0 0.017 3.7
ENSMUSG00000003545 Fosb 1.38 5.4 6.0 0 0.028 3.2
ENSMUSG00000032690 Oas2 -1.28 5.8 -5.8 0 0.028 3.0
ENSMUSG00000047181 Samd14 1.22 4.8 5.9 0 0.028 2.9
ENSMUSG00000022412 Mief1 0.72 6.7 5.8 0 0.028 2.9
ENSMUSG00000034189 Hsdl1 0.61 7.5 5.7 0 0.034 2.4
ENSMUSG00000025793 Hgs 1.08 6.6 5.6 0 0.038 2.4
ENSMUSG00000039738 Slx4 -0.97 5.4 -5.4 0 0.038 2.2
ENSMUSG00000066877 Nck2 1.21 4.8 5.4 0 0.038 2.1
ENSMUSG00000038418 Egr1 0.66 7.9 5.5 0 0.038 1.9

 

Pik3ap1 is still not significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 0.97 5.9 4.4 0 0.087 0.14

 

4.1.2 Surrogate variable analysis (SVA)

To dig a bit further into PC1 / ng_per_ul, which might be introducing unwanted noise into the model, I identified surrogate variables using the sva package. Surrogate variables represent sources of variance in the data, not accounted for by the primary design variables (i.e., the factors of interest); SVA is commonly used to identify and remove potential confounding variables when there may not be clear “batches” among samples.

SVAs were computed on the normalized data with the model matrix for the ~ koStatus + hscPop design

As shown in the plots below, PC1, SV1, and also the voom sample quality weights all appear to be correlated with RNA concentration (ng_per_ul).

koSva <- sva(koHscVoom$E, koHscVoom$design)
## Number of significant surrogate variables is:  3 
## Iteration (out of 5 ):1  2  3  4  5

 

Design 3: expression ~ koStatus + hscPop + ng_per_ul

To control for RNA concentration, I added the ng_per_ul as a parameter in the model.

# Define the design matrix, including terms corresponding to KO status, HSC
# population, and RNA concentration; use voom to calculate transformed 
# expression values
koHscConcDesign <- model.matrix(~ koStatus + hscPop + ng_per_ul, 
                                data = confoundDat)
koHscConcVoom <- voomWithQualityWeights(dge, design = koHscConcDesign,
                                    plot = TRUE)

Notably, voom still down-weights the libraries with low ng_per_ul, suggesting that the “poor” quality in these samples is not fully explained by the RNA concentration variable.

# Fit model for the group design
koHscConcFit <- lmFit(koHscConcVoom, koHscConcDesign)
koHscConcFit <- eBayes(koHscConcFit)
koHscConcResults <- topTable(koHscConcFit, coef = 2, number = nrow(dge))
This new model produces 274 significant genes for koStatus (too many to print here). Among these genes, 57 showed greater than 2 fold change (logFC > 1) difference in expression between KO and WT:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.3 5.47 8.8 0.000 0.002 6.528
ENSMUSG00000036478 Btg1 1.1 6.38 7.8 0.000 0.003 6.049
ENSMUSG00000060126 Tpt1 1.2 5.82 7.4 0.000 0.003 5.173
ENSMUSG00000032690 Oas2 -1.3 5.82 -7.2 0.000 0.003 5.109
ENSMUSG00000003545 Fosb 1.5 5.41 6.6 0.000 0.006 3.571
ENSMUSG00000038872 Zfhx3 1.0 6.80 6.2 0.000 0.007 3.505
ENSMUSG00000022637 Cblb 1.1 7.31 6.1 0.000 0.007 3.314
ENSMUSG00000048154 Kmt2d 1.6 5.65 6.2 0.000 0.007 3.188
ENSMUSG00000040105 Ppapdc2 -1.7 5.05 -6.4 0.000 0.007 2.898
ENSMUSG00000029438 Bcl7a 1.2 6.58 5.7 0.000 0.010 2.661
ENSMUSG00000047181 Samd14 1.3 4.79 6.2 0.000 0.007 2.525
ENSMUSG00000025793 Hgs 1.1 6.59 5.5 0.000 0.012 2.277
ENSMUSG00000066877 Nck2 1.3 4.83 6.0 0.000 0.007 2.244
ENSMUSG00000026064 Ptp4a1 1.0 6.21 5.4 0.000 0.012 2.102
ENSMUSG00000028152 Tspan5 1.2 5.89 5.4 0.000 0.012 2.041
ENSMUSG00000091953 NA 1.1 5.08 5.5 0.000 0.012 1.875
ENSMUSG00000038170 Pde4dip 1.0 6.38 5.3 0.000 0.014 1.799
ENSMUSG00000092060 Bend4 1.0 5.58 5.2 0.000 0.014 1.659
ENSMUSG00000081594 Gm15467 1.1 4.95 5.4 0.000 0.013 1.550
ENSMUSG00000026113 Inpp4a 1.1 5.16 5.3 0.000 0.014 1.505
ENSMUSG00000034522 Zfp395 1.3 5.73 5.1 0.000 0.015 1.469
ENSMUSG00000033126 Ybey 1.4 4.47 5.7 0.000 0.011 1.462
ENSMUSG00000007670 Khsrp 1.5 5.39 5.2 0.000 0.015 1.440
ENSMUSG00000034300 Fam53c 1.1 6.61 5.0 0.000 0.016 1.394
ENSMUSG00000021962 Dcp1a -1.0 5.49 -5.0 0.000 0.016 1.269
ENSMUSG00000091219 NA 1.4 5.31 5.1 0.000 0.016 1.233
ENSMUSG00000005533 Igf1r 1.5 5.81 4.9 0.000 0.018 1.078
ENSMUSG00000070576 Mn1 1.1 5.18 5.0 0.000 0.017 1.034
ENSMUSG00000089774 Slc5a3 1.3 5.58 4.7 0.000 0.021 0.726
ENSMUSG00000046591 Ticrr -1.3 5.10 -4.7 0.000 0.021 0.680
ENSMUSG00000054945 Gm9958 1.2 4.84 4.7 0.000 0.023 0.376
ENSMUSG00000025017 Pik3ap1 1.1 5.93 4.5 0.000 0.027 0.278
ENSMUSG00000059149 Mfsd4 2.8 5.22 4.6 0.000 0.023 0.197
ENSMUSG00000006344 Ggt5 1.1 5.57 4.4 0.000 0.029 0.149
ENSMUSG00000048264 Dip2c 1.2 5.60 4.4 0.000 0.029 0.134
ENSMUSG00000037336 Mfsd2b -1.4 5.40 -4.4 0.000 0.030 0.107
ENSMUSG00000090613 NA 1.0 6.17 4.4 0.000 0.030 0.100
ENSMUSG00000032035 Ets1 1.2 5.94 4.3 0.001 0.034 -0.053
ENSMUSG00000011267 Zfp296 -1.6 4.23 -4.6 0.000 0.023 -0.133
ENSMUSG00000085013 NA 1.1 4.78 4.2 0.001 0.035 -0.352
ENSMUSG00000086985 NA -1.0 4.89 -4.2 0.001 0.036 -0.356
ENSMUSG00000039126 Prune2 -1.9 4.71 -4.4 0.000 0.030 -0.399
ENSMUSG00000069056 NA 1.7 4.00 4.5 0.000 0.025 -0.486
ENSMUSG00000003233 Dvl3 1.6 4.09 4.5 0.000 0.026 -0.529
ENSMUSG00000022066 Gm21685 1.4 4.82 4.1 0.001 0.039 -0.549
ENSMUSG00000090877 Hspa1b -1.4 3.81 -4.0 0.001 0.046 -0.920
ENSMUSG00000086869 NA 1.5 3.41 4.5 0.000 0.026 -1.052
ENSMUSG00000055137 Sugct 1.2 4.00 4.0 0.001 0.045 -1.090
ENSMUSG00000074911 NA 1.6 3.26 4.5 0.000 0.027 -1.295
ENSMUSG00000038763 Alpk3 1.7 3.62 4.1 0.001 0.042 -1.403
ENSMUSG00000044288 Cnr1 3.4 1.94 4.1 0.001 0.042 -2.287
ENSMUSG00000032372 Plscr2 -3.8 2.61 -4.3 0.001 0.032 -2.302
ENSMUSG00000037849 Gm4955 -2.6 1.27 -4.0 0.001 0.043 -2.366
ENSMUSG00000001943 Vsig2 -2.5 1.94 -4.5 0.000 0.026 -2.596
ENSMUSG00000064202 4430402I18Rik -3.0 1.39 -4.0 0.001 0.044 -2.678
ENSMUSG00000025213 Kazald1 5.2 1.47 5.2 0.000 0.015 -3.066
ENSMUSG00000024935 Slc1a1 -5.6 0.31 -4.2 0.001 0.036 -3.310

 

Pik3ap1 is significant after multiple testing correction: kable(koHscConcResults %>% filter(mgiSymbol == "Pik3ap1"), digits = 3, format = "html")

While, for the reasons stated above, differential BCAP expression should probably not be viewed as a positive control, it’s at least somewhat encouraging to see the model better account for this difference.

 

Design 3.5: expression ~ koStatus + hscPop + ng_per_ul, no quality weighting of samples

Because I’ve accounted for the most obvious source of the variance explained by PC1—and because ng_per_ul also appears to be correlated with the sample quality weights from voom—I opted to train an additional model without the quality weighting.

# Define the design matrix, including terms corresponding to KO status, HSC
# population, and RNA concentration; use voom to calculate transformed 
# expression values
koHscConcVoomNoQual <- voom(dge, design = koHscConcDesign, plot = TRUE)

# Fit model for the group design
koHscConcNoQualFit <- lmFit(koHscConcVoomNoQual, koHscConcDesign)
koHscConcNoQualFit <- eBayes(koHscConcNoQualFit)
koHscConcNoQualResults <- topTable(koHscConcNoQualFit, coef = 2, number = nrow(dge))

This new model produces 201 significant genes for koStatus.

 

The decrease in significant genes seems to suggest that there is enough difference in library quality (or possibly some other source of noise)—even when accounting for ng_per_ul—that not using voom’s quality weighting leads to a worse model.

Pik3ap1 is again significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 1 5.9 4.4 0 0.032 0.25

 

4.2 Model (design) comparison

The plot below shows the full list of genes that were found to be significant with one or more of the above model designs. For each gene, a colored bar indicates the design under which it was found to be significant (the size of the bar is scaled by the adjusted p-value). Note: the plot is intended to display overlap between the model results, so gene labels are not shown.

The gene lists for designs 1 and 2 overlap completely with designs 3 and 3.5; there were 5 genes from design 3.5 that were not significant in design 3.

Genes with significantly different expression (adj. p-value < 0.05) as a function of BCAP KO status:

 

Below is a volcano plot for the Design 3 model results. Yellow markers denote genes with adjusted p-value < 0.05; faded markers between the vertical dashed lines represent genes with fold change \(\leq\) 2 (logFC \(\leq\) 1).


5 Examining HSC populations separately

Just to check, I also tested for differential expression with only libraries from the same HSC population included. For both long- and short-term HSCs, there were no significant genes for koStatus.

5.1 Long-term HSC mice only

# Get data for libraries from long-term HSC population
groupDatLong <- groupDat %>% 
    filter(hscPop == "long")
countDatLong <- countDat[, names(countDat) %in% groupDatLong$libID]

# Filter genes with (cpm > 10) in < 20% of samples
dgeLong = setUpDGEList(countData = countDatLong, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26152 11839

 

Design: expression ~ koStatus

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignLong <- model.matrix(~ koStatus, data = groupDatLong)
koVoomLong <- voomWithQualityWeights(dgeLong, design = koDesignLong,
                                    plot = TRUE)

# Fit model for the group design
koFitLong <- lmFit(koVoomLong, koDesignLong)
koFitLong <- eBayes(koFitLong)
koResultsLong <- topTable(koFitLong, number = nrow(dgeLong))

This model produces 0 significant genes for koStatus.

 

5.2 Short-term HSC mice only

# Get data for libraries from long-term HSC population
groupDatShort <- groupDat %>% 
    filter(hscPop == "short")
countDatShort <- countDat[, names(countDat) %in% groupDatShort$libID]

# Filter genes with (cpm > 10) in < 20% of samples
dgeShort = setUpDGEList(countData = countDatShort, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26097 11894

 

Design: expression ~ koStatus

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignShort <- model.matrix(~ koStatus, data = groupDatShort)
koVoomShort <- voomWithQualityWeights(dgeShort, design = koDesignShort,
                                    plot = TRUE)

# Fit model for the group design
koFitShort <- lmFit(koVoomShort, koDesignShort)
koFitShort <- eBayes(koFitShort)
koResultsShort <- topTable(koFitShort, number = nrow(dgeShort))

This model produces 0 significant genes for koStatus.


6 Session info

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] shiny_0.12.2      knitr_1.11        reshape2_1.4.1   
##  [4] readxl_0.1.0.9000 readr_0.1.1       stringr_1.0.0    
##  [7] ggthemes_2.2.1    dplyr_0.4.2       sva_3.14.0       
## [10] genefilter_1.50.0 mgcv_1.8-7        nlme_3.1-121     
## [13] cowplot_0.5.0     ggbiplot_0.55     scales_0.2.5     
## [16] plyr_1.8.3        ggplot2_1.0.1     biomaRt_2.24.0   
## [19] edgeR_3.10.2      limma_3.24.15    
## 
## loaded via a namespace (and not attached):
##  [1] splines_3.2.1        lattice_0.20-33      colorspace_1.2-6    
##  [4] htmltools_0.2.6      stats4_3.2.1         yaml_2.1.13         
##  [7] XML_3.98-1.3         survival_2.38-3      DBI_0.3.1           
## [10] BiocGenerics_0.14.0  munsell_0.4.2        gtable_0.1.2        
## [13] codetools_0.2-14     evaluate_0.7.2       labeling_0.3        
## [16] Biobase_2.28.0       IRanges_2.2.7        httpuv_1.3.3        
## [19] GenomeInfoDb_1.4.2   parallel_3.2.1       AnnotationDbi_1.30.1
## [22] highr_0.5            proto_0.3-10         Rcpp_0.12.0         
## [25] xtable_1.7-4         formatR_1.2          S4Vectors_0.6.3     
## [28] annotate_1.46.1      mime_0.3             digest_0.6.8        
## [31] stringi_0.5-5        tools_3.2.1          bitops_1.0-6        
## [34] magrittr_1.5         RCurl_1.95-4.7       lazyeval_0.1.10     
## [37] RSQLite_1.0.0        MASS_7.3-43          Matrix_1.2-2        
## [40] assertthat_0.1       rmarkdown_0.7        R6_2.1.1